Artificial Neural Networks

Neural network playground

Basic working mechanism

artificial-nn-1.png|600

How it works

  1. Randomly initialise the weights to small numbers close to 0 (but not 0).
  2. Input the first observation of your dataset in the input layer, each feature in one input node.
  3. Forward-propagation: from left to light, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y.
  4. Compare the predicted result to the actual result. Measure the generated error.
  5. Back-propagation: from right to left, the error is propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by now much we update the weights.
  6. Repeat Steps 1 to 5 and update the weights after each observation (Reinforcement Learning); OR: Repeat Steps 1 to 5 but update the weights only after a batch of observations (Batch Learning).
  7. When the whole training set passed through the ANN, that makes an epoch. Redo more epochs.

Epoch vs. Batch

Activation functions

There are multiple forms

ϕ(z)=11+ez

Pasted image 20230309212108.png|300

ϕ(z)=ezezez+ez

Pasted image 20230309212317.png|300

ϕ(z)=Softmax(zi)=exp(zi)jexp(zj)

Why "soft": because it generates values between 0 and 1 and sum to 1, so the max value is a probability smaller than 1, instead of "hard" max value as 1 and the rest as 0.

ϕ(z)=max(z,0)

An alternative (better) version of ReLU is leaky ReLU.

ϕ(z)=az+b ϕ(z)=1e2z1+e2z

Which to choose depends on your problems:

Why ReLU is faster than sigmoid-like activation functions?

Because minimizing loss function requires Notes/Gradient Descent, whose speed depends on the derivative of the activation function ϕ(z). Both derivatives of sigmoid and tanh functions become very small when ϕ(z), in other words, gradient descent becomes very slow. But this won't happen to ReLU, because its derivative is a constant (when z>0).